skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Journal Article · · Journal of Parallel and Distributed Computing
ORCiD logo [1];  [2];  [3]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  3. Georgia Inst. of Technology, Atlanta, GA (United States)

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with $$n$$ vertices, our algorithm reduces communication volume asymptotically in $$n$$ by a factor of $$\mathscr{O}$$ $$\Big(\sqrt{log \ n}\Big)$$ and latency by a factor of $$\mathscr{O}$$ $$(log \ n)$$. For nonplanar cases, our algorithm can reduce the per-process communication volume by 3× and latency by $$\mathscr{O}$$ $$\Big(n^\frac13\Big)$$ times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST. Our new 3D code achieves empirical speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (Halo) algorithm for co-processor offload. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24× for planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1559632
Alternate ID(s):
OSTI ID: 1547464
Journal Information:
Journal of Parallel and Distributed Computing, Vol. 131, Issue 9; ISSN 0743-7315
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

References (20)

Robust Memory-Aware Mappings for Parallel Multifrontal Factorizations journal January 2016
Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems journal September 2016
PT-Scotch: A tool for efficient parallel graph ordering journal July 2008
Communication Avoiding Rank Revealing QR Factorization with Column Pivoting journal January 2015
Parallel Scheduling of Task Trees with Limited Memory journal July 2015
A separator theorem for graphs of bounded genus journal September 1984
Graph Grammar based Multi-thread Multi-frontal Direct Solver with Galois Scheduler journal January 2014
CALU: A Communication Optimal LU Factorization Algorithm journal October 2011
Highly scalable parallel algorithms for sparse matrix factorization journal May 1997
Parallel Algorithms for Sparse Linear Systems journal September 1991
Limiting Communication in Parallel Sparse Cholesky Factorization journal September 1991
Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers journal March 2002
Communication lower bounds for distributed-memory matrix multiplication journal September 2004
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version journal January 2013
A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling journal October 2014
An overview of SuperLU: Algorithms, implementation, and user interface journal September 2005
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems journal June 2003
A Separator Theorem for Planar Graphs journal April 1979
SymPy: symbolic computing in Python journal January 2017
A CPU–GPU hybrid approach for the unsymmetric multifrontal method journal December 2011

Cited By (1)

Preparing sparse solvers for exascale computing
  • Anzt, Hartwig; Boman, Erik; Falgout, Rob
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166 https://doi.org/10.1098/rsta.2019.0053
journal January 2020